Analyzing IMDb's Movie Database: A Data Science Tutorial

By Alexander Melville


Introduction

In this project we analyze the data from IMBd's movie dataset in order to determine what factors effect the gross income of a movie.


About IMDb

IMDb is one of the world's largest online movie databases, containing reviews and user ratings for over 500,000 movies.


Data Source

The dataset used in this project contains all movies on IMDb with more than 100 user ratings, as of 01/01/2020. The dataset was scraped from IMDb's website https://www.imdb.com and uploaded to the data science website Kaggle by a Kaggle user. The Kaggle page for this dataset can be accessed at https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset. Note: Kaggle is a great resource for finding tons of datasets for your future projects!


Libraries

The Python libraries used in this project are the following:


Part 1: Data Processing

Before our dataset can be analyzed, it needs to be processed and tidied as needed.


Loading the dataset

Since our dataset is already downloaded as a .csv file, it can be easily loaded into a Pandas dataframe. We'll call our dataframe "df" to keep the name short.

A warning?

Loading our dataset into a dataframe produced the following warning: "DtypeWarning: Columns (3) have mixed types.Specify dtype option on import or set low_memory=False." The loading function involves guessing what kind of data is in a given column, and this warning means that two different datatypes have been found in the same column. Let's keep this in mind for when we tidy the dataframe.


Viewing the data

Since our table has a large number of columns, let's first change a display option so that all of the columns will be shown when we print our dataframe:

Here's the first 5 rows of our dataframe (you may need to use the horizontal scrollbar to see all of the columns):

And here's our list of columns (with their data types) for convenience:

Tidying the data

We're gonna need to change some of the values in our dataframe before we can properly use it. Firstly, let's figure what the problem was with column 3. It's the year column, so all of its values are supposed to be ints. The loading function must have seen some values that weren't ints, such as strings. Let's see if we can find non-numeric string values in the year column:

Huh, it turns out it's just one row with some non-numeric characters in front of the year. Even so, it's still good practice to fix issues like this by tidying every row. Here's how to remove all of the non-numeric characters in the year column, and then set the year column to be a column of ints.

Now that that's done, let's change "worlwide_gross_income" (which has a typo) into a more managable name:

Next, let's get rid of some of all of the data columns that aren't useful for analyzing what influences gross income. Such columns are those related to movie titles, people involved with the movie (this could be an important factor, but would require a much more complex analysis), values representing contemporary reviews (they can't effect the income due to being contemporary), and any redundant data.

Another important bit of tidying is changing budget and income values from dollar amounts to raw integers. Really, we just need to remove the dollar signs from those values. We should also set those columns to be float columns while we're at it.

The final bit of tidying we need to do is to remove any rows that have a NaN value for the worldwide_gross_income column, since that column is the focus of our analysis:

And here's the result of our tidying:


Part 2: Exploratory Analysis & Data Visualization

Now that our data has been tidied, we can start looking at how different variables relate to one another, broadly speaking. Understanding the basic relationships between different variables will help us understand which relationships to look at further in the later stages of our data analysis. In order to understand these relationships, we need to visualize them via plotting them out.


Year vs Income

The first relationship we're gonna plot out is how the average income changes over time. Since we are graphing a single variable over time, we should use a line graph. We can easily make line graphs using the Matplotlib module:

Looking at our graph we can see that generally, average income increases over time. There are a few crazy spots on our graph though; those spikes in the 1940s propably have to do with World War II. We should try to think about why our data says what it does. The general increase in income over time has to be affected by monetary inflation, but it may also have to with the general growth of the movie industry over time. In order to figure out how much of this is caused by inflation, let's plot our average incomes adjusted for inflation:

From this graph, we can see that the increasing trend over time was caused due to inflation.


Budget vs Income

Big-hollywood movies tend to be the most popular. One would assume that means there's a correlation between budget and income. Since this relationship may not flow linearly from left to right like time does, we shouldn't plot this as a line graph. Instead, let's make a scatter plot, using Seaborn:

Looking at all of our points, we can see a (fuzzy) positive relationship between budget and income. The trend line plotted on top of the scatter plot gives a more precise picture of this correlation. And since both budget and income are affected by inflation, we probably don't need to calculate a version of this plot adjusted for inflation. This doesn't look like a very tight correlation, but it is a correlation.


Duration vs Income

So how can duration and income be related? Perhaps big Hollywood productions tend to have a certain duration, creating a correlation between duration and income? If there is a relationship the scatter plot will show it:

Well this doesn't look very promising at all. The points aren't really showing much of a relationship at all. It's just of big blob of points. The trend line can't plot a meaningful line through our data; there is no way to draw a line of best fit for this data. Looking at this, we should assume that there's not a meaningful linear relationship between duration and income in our dataset.


Country vs Income

This one's gonna be a bit harder to do, because countries are discrete values, as opposed to continuous values like time. Furthermore, a movie can have multiple countries! The best way to plot relationship is a bar graph, but first we need a better way to represent the countries associated with each movie. One method of doing this is by making a dataframe column for each country. Each movie associated with a given movie will have a 1 in that movie's column, and a 0 otherwise. (This approach will be especially helpful later, when we are generating regressions.) But we can't make a column for every country, that would be excessive. Let's find which countries have the most movies associated with them:

It looks like the top 5 countries are USA, France, UK, Germany, and India, so those are the countries that we are going to use. Now, let's make country columns in our dataframe:

Let's see what our dataframe looks like now, with the new country columns:

Now that our country data is better orginized, let's plot the relationship between income and country using a bar graph:

So mean income varies by country, with USA being the country (of the countries we're looking at) with the highest mean income. Given that the United States has Hollywood, that's not surprising.


Genre vs Income

Plotting this is going to be very similar to plotting countries vs income. Since there are fewer genres than countries, let's try and use all of the genres:

We can see that mean income varies based on genre. Interestingly, a big chunk of the genres have similar mean incomes.


Part 3: Analysis Using Machine Learning

Now that we've visualized the data, let's generate a statistical model of the factors influencing worldwide gross income. For this analysis, we are going to generate linear regressions.


Understanding Linear Regressions

A linear regression is similar to a linear equation (y = mx + b). A linear equation takes an input value, multiplies it by some scalar, and adds a constant to output an output value. A linear regression can take n input values (from the n input variables of the regression), multiply them by some scalars, add them together, add a constant, and output the sum. That sum is the prediction of what the output should be based on the input values, give or take some random error. A linear regression with n different independent variables can be visualized as the following equation:

Y = b0 + (b1 * x1) + (b2 * x2) + ... + (bn * xn) + random error

We generate linear regressions much like how we generate lines of best fit. Given a bunch of points on a grid, we can draw a line through them that represents the line of best fit. Now imagine you have a bunch of n-dimensional points. These points are the rows in your dataframe. You are using n - 1 columns as input variables, and 1 row as an output variable. Given these points, you can draw (using an algorithm in Python), a "line" of best fit. That "line" is your regression. (The trend lines we plotted earlier on top of our scatter plots were linear regressions for a bunch of 2-dimensional points.) Now, if you have a new set of input values, you can calculate where the output value should be based on your regression. Linear regressions have a random error because the models we make probably aren't taking every possible factor into account (and we don't know what factors we're missing). In order to see if our linear regression is accurate, we only use some of our dataset (the training set) to generate the regression. We call that subset of our data the training set. Then we test the regression on the rest of our data (the test set) to see how accurately the regression can predict that data's output values from its input values.


Testing the Effectiveness of Linear Regressions

In order to test the effectiveness of our regressions, we will calculate an R-squared value for the training set and for the test set. An R-squared value is a measure of how accurately a regression can predict the actual output values of the data. It's how close our line of best fit gets to the actual points. An R-squared value of 1.0 means that the regression perfectly predicts the output values. The R-squared value for the training set will tell us how well our regression can predict the data used to generate it, and the R-squared value for the test set will tell us how well our regression can predict new data. If the training R-squared value is significant larger than the test R-squared value, that means the regression is significantly better at predicting existing data than new data. This is called overfitting, and it's a problem since the purpose of a regression is to predict new output values from new input values.


More Tidying of Data

Before we can start generating linear regressions, there is a little bit of clean up we need to do. Since the linear regression generation function won't work on data containing NaN values, we need to remove any rows with NaN values in columns that we want to use in our regressions. Let's see what columns have NaN values:

From this we can see that a massive proportion of our data has NaN for the budget column. If we want to genreate regressions based on budget, we're going to have to use a much smaller dataset. Budget looked like a promising factor from our initial analysis however, so we should still pursue it. Let's make a second dataframe containing only rows with useable budget values:

Now we're ready to start generating regressions.


Generating Simple Linear Regressions

Before we make more complicated regressions, let's make a simple 2d regression for budget vs income using our df_bug dataframe. To do this, we are going to use the module Sklearn. We've already make this regression before using Seaborn, but SKLearn will allow us to make more complicated regression later on, and allow us to more easier test those regressions.

So our regression has a little bit more than .5 accuracy. This means that budget is a fairly good predictor for income. There is also very little difference between our training and test accuracies, so this model will work on any new data. This model is based on a smaller dataset, however.


Residual Plots

In addition to calculating R-squared values, we should analyze our residuals. A residual is the difference in the Y-value between a data point and the regression line. To represent our residuals for a given regression, we can plot our each of our data points, with each point's y-value changed to be that point's residual. If our model isn't missing some significant trend in the data, our residuals shouldn't follow any trend either. The line of best fit through our residuals should be y = 0. Let's plot the residuals for the regression we just made:

So our residuals follow a trend line that's relatively close to y = 0, so our residuals have a mean close to zero. Furthermore, our residuals are mostly randomly scattered. However, our residuals become densely clustered around the mean near the left part of our graph. This implies that our regression model is missing some factor. Let's make some regressions with more x-variables!


Generating Complex Linear Regressions

In order to generate regressions based on complex sets of factors, let's write a general function for making and viewing regressions:


A Regression for Year, Budget, and Duration vs Income

Now let's look at the regression for year, budget, and duration vs income:

According to our R-squared values, this isn't really any more accurate than our regression based only budget. So how effective are the models based only on year, and only on duration? Let's see:


Regressions Based on Year and on Duration

Those are both very ineffective models based on their R-squared values. Let's try a regression based on countries:


A Regression for Countries vs Income

This isn't a very useful model either! Let's try genres:


A Regression for Genres vs Income

Another useless model. It looks like budget was by far the best factor for determining income. For fun, let's try a regression based on all of our factors:


A Regression for All Factors vs Income

This final regression performs very slightly better than the budget model, but certainly not better enough to justify the amount of data it requires to make predictions. Having this many factors also can lead to overfitting, so it's usually better to stick to simpler models.


Part 4: Conclusion


Recap

In this project, we've learned how to:

Our Analysis of the IMDb dataset has shown us that, from the set of factors we looked at, only movie budget can be used to effectively predict the worldwide gross income of a movie.


Going Forward

Today, we've learned about linear regressions, but there are many other regressions. One of these is logistic regression, which can be used to predict whether something is one of two binary states. For example, if we wanted to predict whether or not a movie was a comedy, we could use a logistic regression based on factors such as year, duration, and/or other genres. Speaking of factors, you may be able to make a model more efficient by using interaction features, a kind of factor that combines multiple factors. For example, let's imagine that Western movies were really popular up until the 1960s. Furthermore, let's imagine that movies made more and more money as time went on. The worldwide gross income of a movie then might depend on the year (higher year means more money) and also on the combination of year and Western (if the year is low and the movie is a Western, the movie will make more money). That combination of year and Western and year is an interaction feature. If you are interested in either logistic regressions or interaction features, try using them to analyze other parts of the IMDb dataset! Or if you're tired of movies, there's a ton of datasets available for free at Kaggle!